Multi-step texture processing with feedback in texture unit

ABSTRACT

Techniques are described for using a texture unit to perform operations of a shader processor. Some operations of a shader processor are repeatedly executed until a condition is satisfied, and in each execution iteration, the shader processor accesses the texture unit. Techniques are described for the texture unit to perform such operations until the condition is satisfied.

TECHNICAL FIELD

This disclosure relates to graphics processing systems, and moreparticularly, to graphics processing systems that utilize a textureunit.

BACKGROUND

Computing devices often utilize a graphics processing unit (GPU) toaccelerate the rendering of graphics data for display. Such computingdevices may include, e.g., computer workstations, mobile phones such asso-called smartphones, embedded systems, personal computers, tabletcomputers, and video game consoles. GPUs typically execute a graphicsprocessing pipeline that includes a plurality of processing stages whichoperate together to execute graphics processing commands. A host centralprocessing unit (CPU) may control the operation of the GPU by issuingone or more graphics processing commands to the GPU.

SUMMARY

This disclosure is directed to using a texture unit to implementoperations of a shader processor of a graphics processing unit (GPU) tolimit calls to the texture unit. Operations that include multiple callsto the texture unit that a shader processor is to perform are insteadperformed by the texture unit. Each of these operations that the shaderprocessor is to perform may instead be performed by hardware componentswithin the texture unit. In this way, the GPU leverages the hardware ofthe texture unit to perform operations that the shader processor is toperform and limits calls to the texture unit.

In one example, the disclosure describes an example method of processingdata, the method comprising receiving, with a texture unit, aninstruction instructing the texture unit to repeatedly executeoperations based on a condition defined in the instruction beingsatisfied, repeatedly executing, with the texture unit, the operationsbased on the condition defined in the instruction being satisfied or notbeing satisfied, and outputting, with the texture unit and to a graphicsprocessing unit (GPU), data resulting from the repeated execution of theoperations.

In one example, the disclosure describes an example device forprocessing data, the device comprising a graphics processing unit (GPU)comprising a shader processor, and a texture unit configured to receive,from the shader processor of the GPU, an instruction instructing thetexture unit to repeatedly execute operations based on a conditiondefined in the instruction being satisfied, repeatedly execute theoperations based on the condition defined in the instruction beingsatisfied or not being satisfied, and output, to the GPU, data resultingfrom the repeated execution of the operations.

In one example, the disclosure describes an example device forprocessing data, the device comprising means for receiving aninstruction instructing a texture unit to repeatedly execute operationsbased on a condition defined in the instruction being satisfied, meansfor repeatedly executing the operations based on the condition definedin the instruction being satisfied or not being satisfied, and means foroutputting, to a graphics processing unit (GPU), data resulting from therepeated execution of the operations.

In one example, the disclosure describes an example non-transitorycomputer-readable storage medium storing instructions that when executedcause one or more processors of a device for processing data to receivean instruction instructing a texture unit to repeatedly executeoperations based on a condition defined in the instruction beingsatisfied, repeatedly execute the operations based on the conditiondefined in the instruction being satisfied or not being satisfied, andoutput, to a graphics processing unit (GPU), data resulting from therepeated execution of the operations.

The details of one or more examples of the disclosure are set forth inthe accompanying drawings and the description below. Other features,objects, and advantages of the disclosure will be apparent from thedescription and drawings, and from the claims.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram illustrating an example computing device thatmay be used to implement the techniques of this disclosure.

FIG. 2 is a block diagram illustrating the CPU, the GPU and the memoryof the computing device of FIG. 1 in further detail.

FIG. 3 is a block diagram illustrating an example of a texture unit ofFIG. 2 in further detail.

FIG. 4 is a flowchart illustrating an example method of processing datain accordance with one or more example techniques described in thisdisclosure.

DETAILED DESCRIPTION

This disclosure is directed to leveraging a texture unit to performoperations that otherwise would require a shader processor of a graphicsprocessing unit (GPU) to issue multiple calls to the texture unit toperform the operations. For various graphics processing algorithms, theshader processor outputs multiple requests for the texture unit toretrieve texture data, process texture data, and output the processedtexture data (e.g., texels) to the shader processor.

One common factor in the various graphics processing algorithms thatrepeatedly causes the shader processor to access the texture unit is astructure of the instructions that execute on the shader processor. Forexample, the structure of the instructions generally includes a loopwith a termination condition, logic to advance/modify texturecoordinates, and logic to calculate a result of operations defined inthe instructions.

In the techniques described in this disclosure, rather than having theshader processor execute these instructions that include the repeatedcalls to the texture unit, the instructions may be mapped to beperformed by hardware components of the texture unit. For example, thetexture unit may include a feedback path where an output of the textureunit feeds back into a component of the texture unit that receives theinput. With the feedback, the texture unit may be configured toimplement the iterations of the loop, without requiring repeated callsto the texture unit from another unit such as a shader processor. Inthis way, the shader processor may execute one instruction that causesthe shader processor to output a set of data (e.g., the data on whichthe shader processor was going to perform operations) to the textureunit, and the texture unit then performs the iterations of theinstructions in the loop using the internal feedback path, and outputs aresult once with the final data (texels) to the shader processor.Accordingly, the shader processor may need to output once to the textureunit, rather than output multiple times with intermediate data, andreceive data once from the texture unit, rather than having the shaderprocessor access the texture unit multiple times (e.g., rather thanreceiving intermediate output multiple times from the texture unit andinvoking the texture unit with multiple calls).

In some examples, the GPU and texture unit may reside in the sameintegrated circuit or may reside in different integrated circuits. Thetexture unit may be configured to receive, from a shader processor ofthe GPU, an instruction instructing the texture unit to repeatedlyexecute operations (e.g., looped-instructions) based on an occurrence ofa condition defined in the instruction (e.g., the terminationcondition). In response, the texture unit may repeatedly execute theoperations until the condition defined in the instruction is satisfiedor not satisfied, and output to the GPU data resulting from the repeatedexecution of the operations.

FIG. 1 is a block diagram illustrating an example computing device 2that may be used to implement techniques of this disclosure. Computingdevice 2 may comprise a personal computer, a desktop computer, a laptopcomputer, a computer workstation, a video game platform or console, awireless communication device (such as, e.g., a mobile telephone, acellular telephone, a satellite telephone, and/or a mobile telephonehandset), a landline telephone, an Internet telephone, a handheld devicesuch as a portable video game device or a personal digital assistant(PDA), a personal music player, a video player, a display device, atelevision, a television set-top box, a server, an intermediate networkdevice, a mainframe computer or any other type of device that processesand/or displays graphical data.

As illustrated in the example of FIG. 1, computing device 2 includes auser input interface 4, a CPU 6, a memory controller 8, a system memory10, a graphics processing unit (GPU) 12, a local memory 14, a displayinterface 16, a display 18 and bus 20. User input interface 4, CPU 6,memory controller 8, GPU 12 and display interface 16 may communicatewith each other using bus 20. Bus 20 may be any of a variety of busstructures, such as a third generation bus (e.g., a HyperTransport busor an InfiniBand bus), a second generation bus (e.g., an AdvancedGraphics Port bus, a Peripheral Component Interconnect (PCI) Expressbus, or an Advanced eXentisible Interface (AXI) bus) or another type ofbus or device interconnect. It should be noted that the specificconfiguration of buses and communication interfaces between thedifferent components shown in FIG. 1 is merely exemplary, and otherconfigurations of computing devices and/or other graphics processingsystems with the same or different components may be used to implementthe techniques of this disclosure.

CPU 6 may comprise a general-purpose or a special-purpose processor thatcontrols operation of computing device 2. A user may provide input tocomputing device 2 to cause CPU 6 to execute one or more softwareapplications. The software applications that execute on CPU 6 mayinclude, for example, an operating system, a word processor application,an email application, a spread sheet application, a media playerapplication, a video game application, a graphical user interfaceapplication or another program. The user may provide input to computingdevice 2 via one or more input devices (not shown) such as a keyboard, amouse, a microphone, a touch pad or another input device that is coupledto computing device 2 via user input interface 4.

The software applications that execute on CPU 6 may include one or moregraphics rendering instructions that instruct CPU 6 to cause therendering of graphics data to display 18. In some examples, the softwareinstructions may conform to a graphics application programming interface(API), such as, e.g., an Open Graphics Library (OpenGL®) API, an OpenGraphics Library Embedded Systems (OpenGL ES) API, a Direct3D API, anX3D API, a RenderMan API, a WebGL API, or any other public orproprietary standard graphics API. In order to process the graphicsrendering instructions, CPU 6 may issue one or more graphics renderingcommands to GPU 12 to cause GPU 12 to perform some or all of therendering of the graphics data. In some examples, the graphics data tobe rendered may include a list of graphics primitives, e.g., points,lines, triangles, quadralaterals, triangle strips, etc.

Memory controller 8 facilitates the transfer of data going into and outof system memory 10. For example, memory controller 8 may receive memoryread and write commands, and service such commands with respect tomemory 10 in order to provide memory services for the components incomputing device 2. Memory controller 8 is communicatively coupled tosystem memory 10. Although memory controller 8 is illustrated in theexample computing device 2 of FIG. 1 as being a processing module thatis separate from both CPU 6 and system memory 10, in other examples,some or all of the functionality of memory controller 8 may beimplemented on one or both of CPU 6 and system memory 10.

System memory 10 may store program modules and/or instructions that areaccessible for execution by CPU 6 and/or data for use by the programsexecuting on CPU 6. For example, system memory 10 may store userapplications and graphics data associated with the applications. Systemmemory 10 may additionally store information for use by and/or generatedby other components of computing device 2. For example, system memory 10may act as a device memory for GPU 12 and may store data to be operatedon by GPU 12 as well as data resulting from operations performed by GPU12. For example, system memory 10 may store any combination of texturebuffers, depth buffers, stencil buffers, vertex buffers, frame buffers,or the like. In addition, system memory 10 may store command streams forprocessing by GPU 12. System memory 10 may include one or more volatileor non-volatile memories or storage devices, such as, for example,random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM),read-only memory (ROM), erasable programmable ROM (EPROM), electricallyerasable programmable ROM (EEPROM), flash memory, a magnetic data mediaor an optical storage media.

GPU 12 may be configured to perform graphics operations to render one ormore graphics primitives to display 18. Thus, when one of the softwareapplications executing on CPU 6 requires graphics processing, CPU 6 mayprovide graphics commands and graphics data to GPU 12 for rendering todisplay 18. The graphics commands may include, e.g., drawing commandssuch as a draw call, GPU state programming commands, memory transfercommands, general-purpose computing commands, kernel execution commands,etc. In some examples, CPU 6 may provide the commands and graphics datato GPU 12 by writing the commands and graphics data to memory 10, whichmay be accessed by GPU 12. In some examples, GPU 12 may be furtherconfigured to perform general-purpose computing for applicationsexecuting on CPU 6.

GPU 12 may, in some instances, be built with a highly-parallel structurethat provides more efficient processing of vector operations than CPU 6.For example, GPU 12 may include a plurality of processing elements thatare configured to operate on multiple vertices or pixels in a parallelmanner. The highly parallel nature of GPU 12 may, in some instances,allow GPU 12 to draw graphics images (e.g., GUIs and two-dimensional(2D) and/or three-dimensional (3D) graphics scenes) onto display 18 morequickly than drawing the scenes directly to display 18 using CPU 6. Inaddition, the highly parallel nature of GPU 12 may allow GPU 12 toprocess certain types of vector and matrix operations forgeneral-purpose computing applications more quickly than CPU 6.

GPU 12 may, in some instances, be integrated into a motherboard ofcomputing device 2. In other instances, GPU 12 may be present on agraphics card that is installed in a port in the motherboard ofcomputing device 2 or may be otherwise incorporated within a peripheraldevice configured to interoperate with computing device 2. In furtherinstances, GPU 12 may be located on the same microchip as CPU 6 forminga system on a chip (SoC). GPU 12 and CPU 6 may include one or moreprocessors, such as one or more microprocessors, application specificintegrated circuits (ASICs), field programmable gate arrays (FPGAs),digital signal processors (DSPs), or other equivalent integrated ordiscrete logic circuitry.

GPU 12 may be directly coupled to local memory 14. Thus, GPU 12 may readdata from and write data to local memory 14 without necessarily usingbus 20. In other words, GPU 12 may process data locally using a localstorage, instead of off-chip memory. This allows GPU 12 to operate in amore efficient manner by eliminating the need of GPU 12 to read andwrite data via bus 20, which may experience heavy bus traffic. In someinstances, however, GPU 12 may not include a separate cache, but insteadutilize system memory 10 via bus 20. Local memory 14 may include one ormore volatile or non-volatile memories or storage devices, such as,e.g., random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM),erasable programmable ROM (EPROM), electrically erasable programmableROM (EEPROM), flash memory, a magnetic data media or an optical storagemedia.

CPU 6 and/or GPU 12 may store rendered image data in a frame buffer thatis allocated within system memory 10. Display interface 16 may retrievethe data from the frame buffer and configure display 18 to display theimage represented by the rendered image data. In some examples, displayinterface 16 may include a digital-to-analog converter (DAC) that isconfigured to convert the digital values retrieved from the frame bufferinto an analog signal consumable by display 18. In other examples,display interface 16 may pass the digital values directly to display 18for processing. Display 18 may include a monitor, a television, aprojection device, a liquid crystal display (LCD), a plasma displaypanel, a light emitting diode (LED) array, a cathode ray tube (CRT)display, electronic paper, a surface-conduction electron-emitted display(SED), a laser television display, a nanocrystal display or another typeof display unit. Display 18 may be integrated within computing device 2.For instance, display 18 may be a screen of a mobile telephone handsetor a tablet computer. Alternatively, display 18 may be a stand-alonedevice coupled to computing device 2 via a wired or wirelesscommunications link. For instance, display 18 may be a computer monitoror flat panel display connected to a personal computer via a cable orwireless link.

As described, CPU 6 may offload graphics processing to GPU 12. GPU 12may in turn perform various graphics processing algorithms to rendergraphics data. Examples of graphics processing algorithms includeparallax occlusion mapping (POM), screen space ray tracing (SSRT), depthof field (DoF) processing, volume rendering, or water or terrainrendering with dynamic height fields. Additional graphics processingalgorithms also exist, and the above are merely provided as a fewexamples.

Part of the graphics processing in example graphics processingalgorithms includes texturing. Texturing involves a texture unit thatretrieves a bitmap from a texture buffer and overlays the bitmap overgraphical objects. In some examples, GPU 12 includes the texture unit;however, the texture unit may be external to GPU 12. In some examples,the texture unit, GPU 12, and CPU 6 may be all part of the sameintegrated circuit (IC) or microcontroller. In this disclosure, thetexture unit is described as being internal to GPU 12.

To perform graphics processing, a shader processor of GPU 12 may executeoperations of a shader program. Part of the execution of the operationsof the shader program may include repeated access to the texture unit.For example, the graphics processing algorithm implemented by the shaderprocessor may include operations that form the following structure: aloop with an upper bound on the iteration count, a termination conditionin the loop, simple logic to advance/modify texture coordinates, andsimple logic to calculate the result of the operations.

In performing the operations, for each iteration of the loop, the shaderprocessor may output a request to the texture unit to retrieve data,perform processing on the data, and output the data back to the shaderprocessor. This results in multiple requests to the texture unit, whichconsumes power, clock cycles, and bandwidth of connection lines with GPU12 or bandwidth of bus 20 in examples where the texture unit is externalto GPU 12.

In the techniques described in this disclosure, operations that are tobe executed by the shader processor of GPU 12 are instead executed bythe texture unit. For instance, the operations that the shader processorof GPU 12 was to execute include the operations that are to berepeatedly executed until a condition is satisfied (e.g., a loop ofoperations with an upper bound on the iteration count). The texture unitmay instead repeatedly execute these instructions until the condition issatisfied or not satisfied, without the need for repeated requests bythe shader processor. For instance, the condition may be to repeat untilan upper bound is reached (e.g., repeat as long as A<B). In this case,the texture unit repeatedly executes as long as the condition issatisfied. The condition may be to repeat until the condition is nolonger met (e.g., repeat until A≧B). In this case, the texture unitrepeatedly executes as long as the condition is not satisfied.

For example, a texture unit (e.g., one within GPU 12 or external to GPU12) may receive an instruction instructing the texture unit torepeatedly execute operations based on an occurrence of a conditiondefined in the instruction (e.g., such as reaching the upper bound of aniteration count). The texture unit may repeatedly execute the operationsuntil the condition defined in the instruction is satisfied or notsatisfied, and may output data resulting from the repeated execution ofthe operations.

As an example, the texture unit may read a texel value (e.g., from atexture buffer) during a first iteration of execution of the operations.The texture unit may determine whether the condition is satisfied or notsatisfied based on a comparison of the texel value with a variabledefined in the instruction, and determine whether a second iteration ofexecution of the operations is needed based on the determination ofwhether the condition defined in the instruction is satisfied or notsatisfied.

The repeated executing of the operations includes the texture unitoutputting an output of the texture unit as a feedback signal to aninput of the texture unit based on the determination that the seconditeration of execution of the operation is needed. The texture unit mayoutput to the GPU the data resulting from the repeated execution of theoperations based on the determination that the second iteration ofexecution of the operations is not needed.

In this way, the workload of the shader processor may be reduced, ascompared to examples where the shader processor has to execute theoperations in the loop, because the shader processor may be able toissue one instruction to the texture unit and have the texture unitrepeatedly execute the operations. This may also result in less shadercode to be stored in an instruction cache of GPU 12 since all of theoperations that formed the loop could be represented as a singleinstruction to the texture unit.

There may be reduction in power usage and increase in processingefficiency as well. Because the shader processor executes feweroperations, the shader processor may consume less power. The processinghardware units of the texture unit (e.g., arithmetic logic units (ALUs))may be more power efficient as compared to the shader processor, andtherefore, by shifting the execution of the operations to the textureunit, there may be an overall reduction in power. The processinghardware units of the texture unit may also provide higher throughputthan the shader processor, resulting in faster processing of theoperations than in cases where the texture unit repeatedly executes theoperations (e.g., the texture unit is not idle waiting on instructionsfrom the shader processor and does not need to waste clock cyclesrepeatedly outputting to the shader processor).

The size of a general purpose register (GPR) of the shader processor mayalso be reduced as compared to examples where the shader processorrepeatedly executes the operations of the loop until the condition ofthe loop is satisfied. The GPR is a register that the shader processoruses to temporarily store data resulting from execution of an operation.If the shader processor were to repeatedly execute the operations, theshader processor would store resulting data for each operation in theGPR and require a relatively large GPR to store data resulting from eachiteration of execution. With the example techniques described in thisdisclosure, the texture unit would store any intermediate data resultingfrom an iteration of execution, allowing the GPR of the shader processorto be used for other purposes or for the size of the GPR to be reduced.

FIG. 2 is a block diagram illustrating CPU 6, GPU 12 and memory 10 ofcomputing device 2 of FIG. 1 in further detail. As shown in FIG. 2, CPU6 is communicatively coupled to GPU 12 and memory 10, and GPU 12 iscommunicatively coupled to CPU 6 and memory 10. GPU 12 may, in someexamples, be integrated onto a motherboard with CPU 6. In additionalexamples, GPU 12 may be implemented on a graphics card that is installedin a port of a motherboard that includes CPU 6. In further examples, GPU12 may be incorporated within a peripheral device that is configured tointeroperate with CPU 6. In additional examples, GPU 12 may be locatedon the same integrated circuit or microprocessor as CPU 6 forming asystem on a chip (SoC). CPU 6 is configured to execute softwareapplication (App) 24, a graphics API 26, a GPU driver 28 and anoperating system 30.

GPU 12 includes a controller 32, texture unit 34, shader processor 36,one or more fixed function units 38, and local memory 14. In FIG. 2,local memory 14 and texture unit 34 are illustrated as being internal toGPU 12, but local memory 14 and texture unit 34 may be external to GPU12 as well.

Software application 24 may each include at least one of one or moreinstructions that cause graphic content to be displayed or one or moreinstructions that cause a non-graphics task (e.g., a general-purposecomputing task) to be performed on GPU 12. Software application 24 mayissue instructions to graphics API 26. Graphics API 26 may be a runtimeservice that translates the instructions received from softwareapplication 24 into a format that is consumable by GPU driver 28.

GPU driver 28 receives the instructions from software application 24 viagraphics API 26, and controls the operation of GPU 12 to service theinstructions. For example, GPU driver 28 may formulate one or morecommand streams, place the command streams into memory 10, and instructGPU 12 to execute command streams. GPU driver 28 may place the commandstreams into memory 10 and communicate with GPU 12 via operating system30, e.g., via one or more system calls.

Controller 32 may be hardware of GPU 12, may be software or firmwareexecuting on GPU 12, or a combination of both. Controller 32 may controlthe operations of the various components of GPU 12. For example,controller 32 may control when instructions and data are provided to thecomponents, control the reception of instructions and data, and controlthe output of data from GPU 12.

Shader processor 36 and fixed function units 38 together providegraphics processing stages that form a graphics processing pipeline viawhich GPU 12 performs graphics processing. Shader processor 36 may beconfigured to provide programmable flexibility. For instance, shaderprocessor 36 may be configured to execute one or more shader programsthat are downloaded onto GPU 12 via CPU 6. A shader program, in someexamples, may be a compiled version of a program written in a high-levelshading language, such as, e.g., an OpenGL Shading Language (GLSL), aHigh Level Shading Language (HLSL), a C for Graphics (Cg) shadinglanguage, etc.

In some examples, shader processor 36 includes a plurality of processingunits that are configured to operate in parallel, e.g., as a singleinstruction multiple data (SIMD) pipeline. Shader processor 36 may havea program memory that stores shader program instructions, a generalpurpose register (GPR) that stores data that is to be processed and theresulting data, and an execution state register, e.g., a program counterregister that indicates the current instruction in the program memorybeing executed or the next instruction to be fetched. Examples of shaderprograms that execute on shader processor 36 include, for example, avertex shader, a pixel shader, a geometry shader, a hull shader, adomain shader, a compute shader, and/or a unified shader.

One or more fixed function units 38 may include hardware that ishard-wired to perform certain functions. Although the fixed functionhardware may be configurable, via one or more control signals forexample, to perform different functions, the fixed function hardware ofone or more fixed function units 38 typically does not include a programmemory that is capable of receiving user-compiled programs. In someexamples, one or more fixed function units 38 include, for example,processing units that perform raster operations, such as, e.g., depthtesting, scissors testing, alpha blending, etc.

GPU 12 also includes texture unit 34, which is a hardware unit of GPU 12and is used in texturing algorithms. Texturing may include retrieving abitmap from a texture buffer, which may be part of system memory 10,processing the bitmap, and placing this processed bitmap over agraphical object. The bitmap may be considered as a two-dimensionalimage that texture unit 34 processes so that shader processor 36 or oneor more fixed function units 38 can place over a graphical object. Thepixels of the bitmap may be referred to as texels, and the data thattexture unit 34 generates for output to shader processor 36 may bereferred to as texel data.

As a simple example, the bitmap may be a flattened two-dimensional imageof the world map. Texture unit 34 may process this two-dimensional imageof the world map and GPU 12 (e.g., via shader processor 36 and/or fixedfunction units 38) may place this image over a spherical graphicalobject forming a graphical globe.

Although using texturing to form a graphical globe is one example, theremay be various other examples of texturing. Some examples of texturingalgorithms include parallax occlusion mapping (POM), screen space raytracing (SSRT), depth of field (DoF), volume rendering, andwater/terrain rendering with dynamic height fields. The examplesdescribed in this disclosure are applicable to these texturingalgorithms, a subset of these texturing algorithms, texturing algorithmsin addition to these examples, or any combination of the foregoing.

To perform texturing, application 24 may issue instructions to graphicsAPI 26, and in turn to GPU driver 28. GPU driver 28 may issueinstructions to shader processor 36 to execute operations that includecalls to texture unit 34 instructing texture unit 34 to performprocessing of data. In some examples, to perform texturing, shaderprocessor 36 may execute a looped function (e.g., such as a “while” loopor a “for” loop) that has a condition to be satisfied (e.g., the loopcontinues until a bound on the iteration count of the loop issatisfied). During each iteration, shader processor 36 may output thecall to texture unit 34 and receive data back from texture unit 34.

As an example, the structure of the looped function may be:

initialize while (condition) { texOffsets loopBody sample texture }

To further illustrate the looped function, the following is an exampleof operations that shader processor 36 executes for POM rendering.

float height = read_imagef(heightMap, tex).x; float prevHeight = height;while (currentLayerHeight > height) { texOffset += dTex; prevHeight =height; height = read_imagef(heightMap, tex + texOffset).x;currentLayerHeight −= layerHeight; }

In the above example, the shader program that shader processor 36executes causes shader processor 36 to execute an operation via the“read_imagef” function. The “read_imagef” function is used to sampleheightmap texture at location tex_texOffset, and the texOffset+=dTexmodifies texture coordinates. As can be seen, shader processor 36repeatedly executes the “read_imagef” function until the condition thatcurrentLayerHeight becomes equal to height is satisfied. During eachexecution, a texel value is read (e.g., heightmap value) and assigned toa variable (e.g., height). Whether the condition is satisfied is basedon a comparison of the texel value with a variable (e.g., height iscompared to currentLayerHeight to determine whether currentLayerHeightis greater than height).

To perform the operations for the POM rendering, shader processor 36 mayoutput a request to texture unit 34 (e.g., instruct texture unit 34 toexecute the operation of read_imagef) and in return receive texel datafor storage in a GPR. Shader processor 36 may perform the additionaloperations in the above code, and determine whether the condition issatisfied. If the condition is still satisfied, shader processor 36 mayrepeat the request to texture unit 34 and in turn receive texel data forstorage in the GPR, and keep repeating these steps based on whether thecondition is satisfied.

In some examples, shader processor 36 may loop through the operationsuntil a condition a satisfied (e.g., loop until an upper bound isreached) or may loop through the operations as long as a condition issatisfied (e.g., as long is a first value is less than a second value).In these examples, shader processor 36 may loop through the operationsbased on a condition being satisfied (e.g., as long as a condition issatisfied) or not being satisfied (e.g., until a condition issatisfied).

The condition being satisfied may be part of a “while loop,” whereas thecondition not being satisfied may be part of a “do loop.” For instance,the condition may be while A<B perform a set of operations. In thiscase, texture unit 34 may repeatedly execute the operations based on thecondition being satisfied (e.g., if A is less than B, texture unit 34will execute another iteration of the operations). As another example,the condition may be to repeat until A≧B. In this case, texture unit 34may repeatedly execute the operations based on the condition not beingsatisfied (e.g., if A is not equal to or greater than B, texture unit 34will execute another iteration of the operations). The techniquesdescribed in this disclosure are applicable to both cases (e.g.,repeatedly executing based on the condition being satisfied and based onthe condition not be satisfied, which is a function of how the loop isdefined). For ease, the description may refer to the case where textureunit 34 repeatedly executes based on the condition being satisfied, butsuch description should not be read to mean that the techniques are notapplicable to the case where texture unit 34 repeatedly executes basedon the condition not being satisfied.

The repeated calls by shader processor 36 to texture unit 34 mayincrease the workload of shader processor 36, require shader processor36 to include a relatively large GPR that is unavailable for otherpurposes while the loop is being executed, as well as use a largerinstruction cache in local memory 14 to store all of the operations ofthe loop. In the techniques described in this disclosure, rather thanshader processor 36 repeatedly executing the operations that involveaccess to texture unit 34, texture unit 34 may be configured torepeatedly execute operations in response to an access from shaderprocessor 36 so that shader processor 36 does not need to repeatedlyaccess texture unit 34.

For example, texture unit 34 may be configured to repeatedly execute aplurality of operations in response to a single access by shaderprocessor 36. At least some of such operations conventionally would beperformed in response to each of a plurality of multiple accesses byshader processor 36, e.g., one operation in response to one access. Incontrast, in accordance with various examples of this disclosure,texture unit 34 may execute multiple operations in response to a givenaccess to texture unit 34 by shader processor 36.

By reducing the number of accesses to texture unit 34 by shaderprocessor 36 to accomplish a set of operations, the workload of shaderprocessor 36 may be reduced. Furthermore, in some cases, texture unit 34may be capable of performing condition testing (e.g., condition check),mathematical operations in the loop, and other such functions withhigher throughput and utilizing less power than shader processor 36.

For instance, example shader code (e.g., operations of a shader program)of the POM rendering may be one single instruction that shader processor36 outputs to texture unit 34 (e.g., one instance of shader processor 36accessing texture unit 34). As an example, the instruction that shaderprocessor 36 may output to texture unit 34 may be:

float4 result=textureLoop(heightMap, tex, dTex, layerHeight, condition .. . );

In the instruction that shader processor 36 outputs to texture unit 34,shader processor 36 includes variables of the operations that textureunit 34 is to perform (e.g., heightMap, tex, dTex, layerHeight) as wellas a definition of the condition. In turn, texture unit 34 mayrepeatedly execute the operations based on the condition defined in theinstruction received from shader processor 36 being satisfied.

For texture unit 34 to repeatedly execute the operations, texture unit34 may need to be configured to recognize that a single function callinstructs texture unit 34 to perform a particular set of operations. Forinstance, during the design of texture unit 34, texture unit 34 may bedesigned such that if texture unit 34 receives a function having aparticular name or receives a function having a particular set ofvariables or order of variables, then texture unit 34 is to repeatedlyexecute a particular set of operations. As an example, if texture unit34 receives an instruction including the textureLoop function, thentexture unit 34 may determine that texture unit 34 is to repeatedlyexecute operations such as those described above as being executed byshader processor 36. If texture unit 34 receives an instructionincluding a different function (e.g., one for SSRT), then texture unit34 may determine that texture unit 34 is to repeatedly executeoperations that would have otherwise been executed by shader processor36.

Texture unit 34 may be pre-configured to repeatedly execute operationsfor different types of texturing, and as more texturing algorithms aredeveloped, texture unit 34 may be configured to repeatedly executeoperations for these texturing algorithms as well. More generally,although the examples are described as being for texturing algorithms,the techniques described in this disclosure are not so limited. Forinstance, the techniques described in this disclosure may be extended toother cases where loop operations are used that require access totexture unit 34, even if the loop operations are not being used fortexturing purposes. In this way, texture unit 34 may be configured in amanner that may be more closely comparable to a programmable textureprocessing unit.

The developer guide for GPU 12 may include information indicating whichlooped-operations texture unit 34 is configured to perform and theinstruction for the function call to have texture unit 34 perform theoperations. During development of the shader program used to modifytexture coordinates, the developer may include the instruction for thefunction call rather than looped-operations in the code of the shaderprogram.

As another example, rather than relying on the developer to exploit theability of texture unit 34 to execute the looped-operations, a compiler,executing on CPU 6, that compiles the shader program may compile thelooped-operation into a single instruction that includes the particularfunction call to texture unit 34. Alternatively or in addition, GPUdriver 28 or a wrapper for GPU driver 28 may be configured to read thehigh-level language of the code of the shader program and determineplaces in the code that include particular looped-operations thattexture unit 34 is configured to execute. GPU driver 28 or the wrapperfor GPU driver 28 may modify the code of the shader program to includethe single instruction with the particular function call to have textureunit 34 execute the looped-operations.

Accordingly, texture unit 34 may be configured to receive an instructionoutputted by shader processor 36 instructing texture unit 34 torepeatedly execute operations based on a condition defined in theinstruction being satisfied (or not being satisfied). The operations maybe operations of a shader program and include operations to modifytexture coordinates.

Texture unit 34 may repeatedly execute the operations based on thecondition defined in the instruction being satisfied or not beingsatisfied (e.g., as long as the condition is satisfied or until thecondition is satisfied) and repeatedly execute without receiving anyadditional instructions to execute the operations from shader processor36. In this way, the workload of shader processor 36 and the frequencyof interaction between shader processor 36 and texture unit 34 may bereduced.

Texture unit 34 may output data resulting from the repeated execution ofthe operations. For example, texture unit 34 may output the data toshader processor 36 only after all iterations of the repeated executionof the operations are complete. In other words, texture unit 34 may notoutput the data resulting from the repeated execution until after theloop are complete. Accordingly, the number of times texture unit 34needs to output to shader processor 36 may also be limited. However, insome examples, texture unit 34 may periodically or at the conclusion ofone iteration, output data resulting from the execution to shaderprocessor 36. Therefore, the examples of the output of data resultingfrom the repeated execution includes the final data after all iterationsare complete or periodically during the repeated execution.

FIG. 3 is a block diagram illustrating an example of a texture unit ofFIG. 2 in further detail. In FIG. 3, shader processor 36 may output oneinstruction to texture unit 34 instructing texture unit 34 to repeatedlyexecute operations based on a condition defined in the instruction beingsatisfied or not satisfied. For instance, the instruction that shaderprocessor 36 outputs to texture unit 34 includes the variables on whichtexture unit 34 operates and the condition that defines when the loopedoperations are complete.

As illustrated, texture unit 34 includes input unit 40, cache 42 (whichmay be a local cache of texture unit 34 or part of local memory 14),formatting unit 44, filter unit 46, color format unit 48, and outputunit 50. In the example texture unit 34, output unit 50 and input unit40 are connected to one another via feedback signal 52. As described inmore detail, feedback signal 52 provides the mechanism to determinewhether the condition is satisfied or not satisfied.

The units of texture unit 34 illustrated in FIG. 3 are illustrated toease with understanding. Different types of texture unit 34 may includemore, fewer, or different units than those illustrated, and theinterconnection between the components need not necessarily be asillustrated. The techniques described in this disclosure are alsoapplicable to such examples of texture unit 34.

In normal operation (e.g., where texture unit 34 is not be repurposed toperform operations generally performed by shader processor 36), inputunit 40 may be used for addressing purposes. Input unit 40 may convert(u,v) coordinates into memory addresses. Cache 42 may store theinformation addressed by input unit 40. Formatting unit 44 may performvarious formatting on the bitmap as defined by the texturing algorithm.Filter unit 46 may perform bilinear filtering/interpolation. Colorformat unit 48 may format the color of the bitmap. Output unit 50receives the output from color format unit 48 and is the outputinterface to shader processor 36 to output the texel data.

However, in the example techniques described in this disclosure, thesevarious units may be repurposed to repeatedly execution operations of ashader program. For instance, as described above, the structure of thelooped-operations may be as follows: initialize

while (condition) { texOffsets loopBody sample texture }

In some examples, input unit 40 may be configured to perform theinitialize and texOffsets operations. Filter unit 46 may be configuredto perform the operation of the condition. Color format unit 48 may beconfigured to perform the operation of the loopBody. Output unit 50 maybe configured to determine whether the condition is satisfied.

For example, output unit 50 may be configured to determine whether aniteration of execution of the operations is needed based on whether thecondition defined in the instruction is satisfied or not satisfied. Inthe example of a while-loop, output unit 50 may determine whether thecondition to be satisfied is still true. If the condition to besatisfied is still true, output unit 50 may determine that the iterationof execution of the operations is needed (i.e., another pass through theloop). In this case, to repeatedly execute the operations, output unit50 is configured to output, from texture unit 34 (e.g., output dataresulting from one iteration of the loop), feedback signal 52 to inputunit 40 based on the determination that the iteration of execution ofthe operations is needed. If the condition to be satisfied is false,output unit 50 may determine that the iteration of execution ofoperations is no longer needed (i.e., the loop is complete). In thiscase, output unit 50 outputs the data resulting from the repeatedexecution of the operations based on the determination that theiteration of execution of the operations is not needed. In someexamples, input unit 40 may be configured to give feedback signal 52higher priority than any output from shader processor 36.

In some examples, during each iteration of execution of the loop,texture unit 34 may read a texel value. For instance, input unit 40 orformatting unit 44 may be configured to read texel values from a texturebuffer (e.g., located in local memory 14 or possible system memory 10)or from cache 42, and may read a texel value during each iteration.Output unit 50 may compare this read texel value (or a processed versionof the read texel value) with a variable defined in the instruction todetermine whether another round of iteration is needed. In theseexamples, the read texel value controls whether more iterations ofexecution are needed based on the comparison to the variable defined inthe instruction.

For instance, a read unit (e.g., input unit 40 and/or formatting unit44, or possibly some other unit of texture unit 34) may read a texelvalue during a first iteration of execution of the operations. Outputunit 50 may determine whether the condition is satisfied based on acomparison of a value based on the texel value (e.g., the texel valueitself or a value determined from processing the texel value) with avariable defined in the instruction. Output unit 50 may determinewhether a second iteration of execution of the operations is neededbased on the determination of whether the condition defined in theinstruction is satisfied.

In one example, to repeatedly execute the operations, output unit 50 mayoutput a feedback signal to input unit 40 based on the determinationthat the second iteration of execution of the operations is needed. Inanother example, output unit 50 may output the data resulting from therepeated execution of the operations based on the determination that thesecond iteration of execution of the operations is not needed.

As an illustration, for the POM algorithm above, the condition of thewhile loop was while (currentLayerHeight>height) and the loopBody washeight=read_imagef(heightMap.tex+texOffset). In this example, a readunit (e.g., input unit 40 or formatting unit 44, as two non-limitingexamples) may read a texel value (e.g., the value stored atheightMap.tex+texOffset) during a first execution of the operations.

After one iteration of execution of the operations of the while loop,output unit 50 may determine whether currentLayerHeight is still greaterthan height. For example, output unit 50 may determine whether thecondition is satisfied based on a comparison of a value based on thetexel value (texel value itself in this case, which the value of height)with a variable defined in the instruction (e.g., currentLayerHeight).Output unit 50 may determine whether a second iteration of the executionof the operations is needed based on the determination of whether thecondition defined in the instruction is satisfied.

For example, if true (e.g., currentLayerHeight is still greater thanheight), output unit 50 may output the value of height as previouslycalculated back to input unit 40 as feedback signal 52 so that the unitsof texture unit 34 execute an iteration of the operations. Accordingly,to repeatedly execute the operations, output unit 50 may be configuredto output feedback signal 52 to input unit 40 based on the determinationthat the second iteration of execution of the operations is needed. Thisprocess repeats until the condition is no longer true.

If false (e.g., currentLayerHeight is no longer greater than height),output unit 50 may output the final value of height to shader processor36 as determined via the repeated execution of the operations.Accordingly, output unit 50 may output the data resulting from therepeated execution of the operations based on the determination that thesecond iteration of execution of the operations is not needed.

In this example, the first iteration and second iteration are used as away to assist with understanding. There may be multiple iterations, andfor each iteration, output unit 50 may output feedback signal 52 toinput unit 40 based on a comparison between a value based on the readtexel value (e.g., the texel value itself or processed texel value) anda variable defined in the instruction. If another iteration is needed,output unit 50 may output feedback signal 52 to input unit 40, and ifanother iteration is not needed, output unit 50 may output to GPU 12.

The above provided an example using POM. The following provides someadditional example uses including another example of using POM.Furthermore, the above example of POM provided one example of whichunits of texture unit 34 perform which operations. However, thetechniques described in this disclosure are not so limited, and units oftexture unit 34 may perform different operations than the example of POMprovided above. In some examples, the operations include operations tomodify texture coordinates (e.g., texOffset). Also, the techniquesdescribed in this disclosure should not considered limited to texturing,and may be used for other purposes such as ray tracing and otherexamples. In some examples, output unit 50 may periodically output datato shader processor 36 rather than only after completion of alliterations of execution of the operations.

As another example of POM, in the loop structure, the initializeoperation is currHeight=1, the condition is (currHeight>height ANDcurrHeight>0), the texOffsets operation is texCoord+=dTex, and theloopBody is currHeight−=layerHeight. In this example, input unit 40 mayset currHeight=1 and perform the operation of texCoord+=dTex. Formattingunit 44 may assign a true or false value based on whether the conditionof currHeight>height is satisfied. Filter unit 46 may perform theoperation of currHeight−=layerHeight.

Output unit 50 may determine whether an iteration of execution of theoperations is needed based on whether the condition defined in theinstruction is satisfied (e.g., based on the true or false determinationmade by formatting unit 44). In this example, to repeatedly execute theoperations, output unit 50 outputs from texture unit 34 (e.g., data fromone iteration of the operations) feedback signal 52 to input unit 40based on the determination that the iteration of execution of theoperations is needed. Otherwise, to output data, output unit 50 outputsthe data resulting from the repeated execution of the operations basedon the determination that the iteration of execution of operations isnot needed. In each case, whether another iteration of the loop isneeded may be based on a comparison of the texel value (or a valuedetermined from the texel value) and a variable defined in theinstruction.

As an example of screen space ray tracing (SSRT), there may be noinitialize operation. The condition is (P.x*stepDir<=endP ANDstepCount>maxSteps). The texOffsets operation is (P, Q.z, k)+=(dP, dQ.z,dK). The loopBody operation is rayZmax=(dQ.z*0.5+Q.z)/(dK*0.5+k).Similar to above, in this example, input unit 40 may perform theoperation of (P, Q.z, k)+=(dP, dQ.z, dK). Formatting unit 44 may assigna true or false value based on whether the condition ofP.x*stepDir<=endP AND stepCount>maxSteps is satisfied. Filter unit 46may perform the operation of rayZmax=(dQ.z*0.5+Q.z)/(dK*0.5+k).

Output unit 50 may determine whether an iteration of execution ofoperations is needed based on whether the condition defined in theinstructions is satisfied. If true, output unit 50 outputs feedbacksignal 52 to input unit 40 for another execution iteration. If false,output unit 50 outputs the final data resulting from repeated executionby texture unit 34 to shader processor 36.

The example techniques described in this disclosure may also beapplicable to tree traversal algorithms. For example, in bounding volumehierarchy (BVH) tree traversal for ray tracing, texture unit 34 may beconfigured for the ray-box intersection test and to traverse the treeusing many execution iterations of operations in a loop. For quad treetraversal (e.g., screen-space ray tracing, depth of field, volumerendering, view synthesis, etc.), a developer may build quad trees ontop of depth buffers. To traverse the trees, texture unit 34 may beconfigured to execute loop operations in addition to ray-boxintersection test and ray-plane intersection test. The ray-planeintersection test is a simplified ray-box intersection test.

FIG. 4 is a flowchart illustrating an example method of processing datain accordance with one or more example techniques described in thisdisclosure. Texture unit 34 receives an instruction from shaderprocessor 36 of GPU 12 instructing texture unit 34 to repeatedly executeoperations based on a condition defined in the instruction beingsatisfied (54). The operations may be operations of a shader program andoperations to modify texture coordinates. Examples of the operationsinclude POM, SSRT, DoF processing, volume rendering, or water or terrainrendering with dynamic height fields.

Texture unit 34 repeatedly executes the operations based on thecondition defined in the instruction being satisfied or not satisfied(56). For example, texture unit 34 repeatedly executes operations untilthe condition is satisfied (e.g., repeatedly executes if the conditionis not satisfied) or as long as the condition is satisfied (e.g.,repeatedly executes if the condition is satisfied). Also, texture unit34 executes operations based on the condition defined in the instructionbeing satisfied or satisfied without receiving any addition instructionsto execute the operations.

Texture unit 34 outputs data resulting from the repeated execution ofthe operations to shader processor 36 (58). In one example, texture unit34 outputs the data to shader processor 36 only after all iterations ofthe repeated execution of the operations are complete.

In some examples, output unit 50 of texture unit 34 may be configured todetermine whether an iteration of execution of the operations is neededbased on whether the condition defined in the instruction is satisfiedor not satisfied. Output unit 50 may be configured to output fromtexture unit 34 feedback signal 52 to input unit 40 based on thedetermination that the iteration of execution of the operations isneeded. Otherwise, output unit 50 may be configured to output the dataresulting from the repeated execution of the operations based on thedetermination that the iteration of execution of the operations is notneeded. In some examples, in determining whether to execute anotheriteration, output unit 50 may compare the read texel value or a valuebased on the texel value to a variable defined in the instruction.

The techniques described in this disclosure may be implemented, at leastin part, in hardware, software, firmware or any combination thereof. Forexample, various aspects of the described techniques may be implementedwithin one or more processors, including one or more microprocessors,digital signal processors (DSPs), application specific integratedcircuits (ASICs), field programmable gate arrays (FPGAs), or any otherequivalent integrated or discrete logic circuitry, as well as anycombinations of such components. The term “processor” or “processingcircuitry” may generally refer to any of the foregoing logic circuitry,alone or in combination with other logic circuitry, or any otherequivalent circuitry such as discrete hardware that performs processing.

Such hardware, software, and firmware may be implemented within the samedevice or within separate devices to support the various operations andfunctions described in this disclosure. In addition, any of thedescribed units, modules or components may be implemented together orseparately as discrete but interoperable logic devices. Depiction ofdifferent features as modules or units is intended to highlightdifferent functional aspects and does not necessarily imply that suchmodules or units must be realized by separate hardware or softwarecomponents. Rather, functionality associated with one or more modules orunits may be performed by separate hardware, firmware, and/or softwarecomponents, or integrated within common or separate hardware or softwarecomponents.

The techniques described in this disclosure may also be stored, embodiedor encoded in a computer-readable medium, such as a computer-readablestorage medium that stores instructions. Instructions embedded orencoded in a computer-readable medium may cause one or more processorsto perform the techniques described herein, e.g., when the instructionsare executed by the one or more processors. Computer readable storagemedia may include random access memory (RAM), read only memory (ROM),programmable read only memory (PROM), erasable programmable read onlymemory (EPROM), electronically erasable programmable read only memory(EEPROM), flash memory, a hard disk, a CD-ROM, a floppy disk, acassette, magnetic media, optical media, or other computer readablestorage media that is tangible.

Various aspects and examples have been described. However, modificationscan be made to the structure or techniques of this disclosure withoutdeparting from the scope of the following claims.

What is claimed is:
 1. A method of processing data, the methodcomprising: receiving, with a texture unit, an instruction instructingthe texture unit to repeatedly execute operations based on a conditiondefined in the instruction being satisfied; repeatedly executing, withthe texture unit, the operations based on the condition defined in theinstruction being satisfied or not being satisfied; and outputting, withthe texture unit and to a graphics processing unit (GPU), data resultingfrom the repeated execution of the operations.
 2. The method of claim 1,wherein receiving the instruction comprises receiving the instructionfrom a shader processor of the GPU, and wherein outputting comprisesoutputting the data to the shader processor of the GPU.
 3. The method ofclaim 1, further comprising: reading, with the texture unit, a texelvalue during a first iteration of the repeated execution of theoperations; determining, with the texture unit, whether the condition issatisfied or not satisfied by comparing a value based on the texel valuewith a variable defined in the instruction; determining, with thetexture unit, whether a second iteration of execution of the operationsis needed based on the determination of whether the condition defined inthe instruction is satisfied or not satisfied, wherein repeatedlyexecuting the operations comprises outputting, with the texture unit, anoutput of the texture unit as a feedback signal to an input of thetexture unit based on the determination that the second iteration ofexecution of the operations is needed, and wherein outputting datacomprises outputting the data resulting from the repeated execution ofthe operations based on the determination that the second iteration ofexecution of the operations is not needed.
 4. The method of claim 1,wherein repeatedly executing the operations comprises repeatedlyexecuting the operations based on the condition defined in theinstruction being satisfied or not being satisfied without receiving anyadditional instructions to execute the operations.
 5. The method ofclaim 1, wherein the operations comprise operations of a shader program.6. The method of claim 1, wherein the operations comprise operations tomodify texture coordinates.
 7. The method of claim 1, wherein theoperations comprises operations for one or more of parallax occlusionmapping (POM), screen space ray tracing (SSRT), depth of field (DoF)processing, volume rendering, or water or terrain rendering with dynamicheight fields.
 8. The method of claim 1, wherein repeatedly executingthe operations comprises repeatedly executing the operations until thecondition is satisfied or as along as the condition is satisfied.
 9. Themethod of claim 1, wherein outputting the data resulting from therepeated execution of the operations comprises outputting the data to ashader processor only after all iterations of the repeated execution ofthe operations are complete.
 10. A device for processing data, thedevice comprising: a graphics processing unit (GPU) comprising a shaderprocessor; and a texture unit configured to: receive, from the shaderprocessor of the GPU, an instruction instructing the texture unit torepeatedly execute operations based on a condition defined in theinstruction being satisfied; repeatedly execute the operations based onthe condition defined in the instruction being satisfied or not beingsatisfied; and output, to the GPU, data resulting from the repeatedexecution of the operations.
 11. The device of claim 10, wherein thetexture unit is configured to output the data resulting from therepeated execution of the operations to the shader processor of the GPU.12. The device of claim 10, wherein the texture unit comprises: an inputunit; a read unit configured to read a texel value during a firstiteration of the repeated execution of the operations; and an outputunit configured to: determine whether the condition is satisfied or notsatisfied by comparing a value based on the texel value with a variabledefined in the instruction; determine whether a second iteration ofexecution of the operations is needed based on the determination ofwhether the condition defined in the instruction is satisfied or notsatisfied, wherein to repeatedly execute the operations, the output unitis configured to output a feedback signal to the input unit of thetexture unit based on the determination that the second iteration ofexecution of the operations is needed, and wherein to output data, theoutput unit is configured to output the data resulting from the repeatedexecution of the operations based on the determination that the seconditeration of execution of the operations is not needed.
 13. The deviceof claim 10, wherein the texture unit is configured to repeatedlyexecute the operations based on the condition defined in the instructionbeing satisfied or not being satisfied without receiving any additionalinstructions to execute the operations.
 14. The device of claim 10,wherein the operations comprise operations of a shader program.
 15. Thedevice of claim 10, wherein the operations comprise operations to modifytexture coordinates.
 16. The device of claim 10, wherein the operationscomprises operations for one or more of parallax occlusion mapping(POM), screen space ray tracing (SSRT), depth of field (DoF) processing,volume rendering, or water or terrain rendering with dynamic heightfields.
 17. The device of claim 10, wherein the texture unit isconfigured to repeatedly execute the operations until the condition issatisfied or as along as the condition is satisfied.
 18. The device ofclaim 10, wherein the texture unit is configured to output the dataresulting from the repeated execution of the operations to the shaderprocessor of the GPU only after all iterations of the repeated executionof the operations are complete.
 19. The device of claim 10, wherein thedevice comprises one of: an integrated circuit; a microprocessor; or awireless communication device.
 20. The device of claim 10, wherein theGPU comprises the texture unit.
 21. A device for processing data, thedevice comprising: means for receiving an instruction instructing atexture unit to repeatedly execute operations based on a conditiondefined in the instruction being satisfied; means for repeatedlyexecuting the operations based on the condition defined in theinstruction being satisfied or not being satisfied; and means foroutputting, to a graphics processing unit (GPU), data resulting from therepeated execution of the operations.
 22. The device of claim 21,further comprising: means for reading a texel value during a firstiteration of the repeated execution of the operations; means fordetermining whether the condition is satisfied or not satisfied bycomparing a value based on the texel value with a variable defined inthe instruction; means for determining whether a second iteration ofexecution of the operations is needed based on the determination ofwhether the condition defined in the instruction is satisfied or notsatisfied, wherein the means for repeatedly executing the operationscomprises means for outputting an output of the texture unit as afeedback signal to an input of the texture unit based on thedetermination that the second iteration of execution of the operationsis needed, and wherein the means for outputting data comprises means foroutputting the data resulting from the repeated execution of theoperations based on the determination that the second iteration ofexecution of the operations is not needed.
 23. The device of claim 21,wherein the means for repeatedly executing the operations comprisesmeans for repeatedly executing the operations based on the conditiondefined in the instruction being satisfied or not being satisfiedwithout receiving any additional instructions to execute the operations.24. The device of claim 21, wherein the means for outputting the dataresulting from the repeated execution of the operations comprises meansfor outputting the data to a shader processor only after all iterationsof the repeated execution of the operations are complete.
 25. Anon-transitory computer-readable storage medium storing instructionsthat when executed cause one or more processors of a device forprocessing data to: receive an instruction instructing a texture unit torepeatedly execute operations based on a condition defined in theinstruction being satisfied; repeatedly execute the operations based onthe condition defined in the instruction being satisfied or not beingsatisfied; and output, to a graphics processing unit (GPU), dataresulting from the repeated execution of the operations.
 26. Thenon-transitory computer-readable storage medium of claim 25, furthercomprising instructions that cause the one or more processors to: read atexel value during a first iteration of the repeated execution of theoperations; determine whether the condition is satisfied or notsatisfied by comparing a value based on the texel value with a variabledefined in the instruction; determine whether a second iteration ofexecution of the operations is needed based on the determination ofwhether the condition defined in the instruction is satisfied or notsatisfied, wherein the instructions that cause the one or moreprocessors to repeatedly execute the operations comprise instructionsthat cause the one or more processors to output an output of the textureunit as a feedback signal to an input of the texture unit based on thedetermination that the second iteration of execution of the operationsis needed, and wherein the instructions that cause the one or moreprocessors to output data comprise instructions that cause the one ormore processors to output the data resulting from the repeated executionof the operations based on the determination that the second iterationof execution of the operations is not needed.
 27. The non-transitorycomputer-readable storage medium of claim 25, wherein the instructionsthat cause the one or more processors to repeatedly execute theoperations comprise instructions that cause the one or more processorsto repeatedly execute the operations based on the condition defined inthe instruction being satisfied or not being satisfied without receivingany additional instructions to execute the operations.
 24. 28. Thenon-transitory computer-readable storage medium of claim 25, wherein theinstructions that cause the one or more processors to output the dataresulting from the repeated execution of the operations compriseinstructions that cause the one or more processors to output the data toa shader processor only after all iterations of the repeated executionof the operations are complete.